{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Choosing String Comparators\n",
    "\n",
    "When building a Splink model, one of the most important aspects is defining the [`Comparisons`](../comparisons/comparisons_and_comparison_levels.md) and [`Comparison Levels`](../comparisons/comparisons_and_comparison_levels.md) that the model will train on. Each `Comparison Level` within a `Comparison` should contain a different amount of evidence that two records are a match, to which the model can assign a match weight. When considering different amounts of evidence for the model, it is helpful to explore fuzzy matching as a way of distinguishing strings that are similar, but not the same, as one another.\n",
    "\n",
    "This guide is intended to show how Splink's string comparators perform in different situations in order to help choosing the most appropriate comparator for a given column as well as the most appropriate threshold (or thresholds).\n",
    "For descriptions and examples of each string comparators available in Splink, see the dedicated [topic guide](./comparators.md)."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What options are available when comparing strings?\n",
    "\n",
    "There are three main classes of string comparator that are considered within Splink:\n",
    "\n",
    "1. **String Similarity Scores**  \n",
    "2. **String Distance Scores**  \n",
    "3. **Phonetic Matching**  \n",
    "\n",
    "where  \n",
    "\n",
    "**String Similarity Scores** are scores between 0 and 1 indicating how similar two strings are. 0 represents two completely dissimilar strings and 1 represents identical strings. E.g. [Jaro-Winkler Similarity](comparators.md#jaro-winkler-similarity).  \n",
    "\n",
    "**String Distance Scores** are integer distances, counting the number of operations to convert one string into another. A lower string distance indicates more similar strings. E.g. [Levenshtein Distance](comparators.md#levenshtein-distance).  \n",
    "\n",
    "**Phonetic Matching** is whether two strings are phonetically similar. The two strings are passed through a [phonetic transformation algorithm](phonetic.md) and then the resulting phonetic codes are matched. E.g. [Double Metaphone](phonetic.md#double-metaphone)."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Comparing String Similarity and Distance Scores\n",
    "\n",
    "Splink contains a `comparison_helpers` module which includes some helper functions for comparing the string similarity and distance scores that can help when choosing the most appropriate fuzzy matching function.\n",
    "\n",
    "For comparing two strings the `comparator_score` function returns the scores for all of the available comparators. E.g. consider a simple inversion \"Richard\" vs \"iRchard\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>string1</th>\n",
       "      <th>string2</th>\n",
       "      <th>levenshtein_distance</th>\n",
       "      <th>damerau_levenshtein_distance</th>\n",
       "      <th>jaro_similarity</th>\n",
       "      <th>jaro_winkler_similarity</th>\n",
       "      <th>jaccard_similarity</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Richard</td>\n",
       "      <td>iRchard</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0.95</td>\n",
       "      <td>0.95</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   string1  string2  levenshtein_distance  damerau_levenshtein_distance  \\\n",
       "0  Richard  iRchard                     2                             1   \n",
       "\n",
       "   jaro_similarity  jaro_winkler_similarity  jaccard_similarity  \n",
       "0             0.95                     0.95                 1.0  "
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from splink.exploratory import similarity_analysis as sa\n",
    "\n",
    "sa.comparator_score(\"Richard\", \"iRchard\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now consider a collection of common variations of the name \"Richard\" - which comparators will consider these variations as sufficiently similar to \"Richard\"?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>string1</th>\n",
       "      <th>string2</th>\n",
       "      <th>error_type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Richard</td>\n",
       "      <td>Richard</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Richard</td>\n",
       "      <td>ichard</td>\n",
       "      <td>Deletion</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Richard</td>\n",
       "      <td>Richar</td>\n",
       "      <td>Deletion</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Richard</td>\n",
       "      <td>iRchard</td>\n",
       "      <td>Transposition</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Richard</td>\n",
       "      <td>Richadr</td>\n",
       "      <td>Transposition</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Richard</td>\n",
       "      <td>Rich</td>\n",
       "      <td>Shortening</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Richard</td>\n",
       "      <td>Rick</td>\n",
       "      <td>Nickname/Alias</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Richard</td>\n",
       "      <td>Ricky</td>\n",
       "      <td>Nickname/Alias</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Richard</td>\n",
       "      <td>Dick</td>\n",
       "      <td>Nickname/Alias</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Richard</td>\n",
       "      <td>Rico</td>\n",
       "      <td>Nickname/Alias</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Richard</td>\n",
       "      <td>Rachael</td>\n",
       "      <td>Different Name</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>Richard</td>\n",
       "      <td>Stephen</td>\n",
       "      <td>Different Name</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    string1  string2      error_type\n",
       "0   Richard  Richard            None\n",
       "1   Richard   ichard        Deletion\n",
       "2   Richard   Richar        Deletion\n",
       "3   Richard  iRchard   Transposition\n",
       "4   Richard  Richadr   Transposition\n",
       "5   Richard     Rich      Shortening\n",
       "6   Richard     Rick  Nickname/Alias\n",
       "7   Richard    Ricky  Nickname/Alias\n",
       "8   Richard     Dick  Nickname/Alias\n",
       "9   Richard     Rico  Nickname/Alias\n",
       "10  Richard  Rachael  Different Name\n",
       "11  Richard  Stephen  Different Name"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "data = [\n",
    "    {\"string1\": \"Richard\", \"string2\": \"Richard\", \"error_type\": \"None\"},\n",
    "    {\"string1\": \"Richard\", \"string2\": \"ichard\", \"error_type\": \"Deletion\"},\n",
    "    {\"string1\": \"Richard\", \"string2\": \"Richar\", \"error_type\": \"Deletion\"},\n",
    "    {\"string1\": \"Richard\", \"string2\": \"iRchard\", \"error_type\": \"Transposition\"},\n",
    "    {\"string1\": \"Richard\", \"string2\": \"Richadr\", \"error_type\": \"Transposition\"},\n",
    "    {\"string1\": \"Richard\", \"string2\": \"Rich\", \"error_type\": \"Shortening\"},\n",
    "    {\"string1\": \"Richard\", \"string2\": \"Rick\", \"error_type\": \"Nickname/Alias\"},\n",
    "    {\"string1\": \"Richard\", \"string2\": \"Ricky\", \"error_type\": \"Nickname/Alias\"},\n",
    "    {\"string1\": \"Richard\", \"string2\": \"Dick\", \"error_type\": \"Nickname/Alias\"},\n",
    "    {\"string1\": \"Richard\", \"string2\": \"Rico\", \"error_type\": \"Nickname/Alias\"},\n",
    "    {\"string1\": \"Richard\", \"string2\": \"Rachael\", \"error_type\": \"Different Name\"},\n",
    "    {\"string1\": \"Richard\", \"string2\": \"Stephen\", \"error_type\": \"Different Name\"},\n",
    "]\n",
    "\n",
    "df = pd.DataFrame(data)\n",
    "df"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `comparator_score_chart` function allows you to compare two lists of strings and how similar the elements are according to the available string similarity and distance metrics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<style>\n",
       "  #altair-viz-22c4436601894be7b50e66680e08528e.vega-embed {\n",
       "    width: 100%;\n",
       "    display: flex;\n",
       "  }\n",
       "\n",
       "  #altair-viz-22c4436601894be7b50e66680e08528e.vega-embed details,\n",
       "  #altair-viz-22c4436601894be7b50e66680e08528e.vega-embed details summary {\n",
       "    position: relative;\n",
       "  }\n",
       "</style>\n",
       "<div id=\"altair-viz-22c4436601894be7b50e66680e08528e\"></div>\n",
       "<script type=\"text/javascript\">\n",
       "  var VEGA_DEBUG = (typeof VEGA_DEBUG == \"undefined\") ? {} : VEGA_DEBUG;\n",
       "  (function(spec, embedOpt){\n",
       "    let outputDiv = document.currentScript.previousElementSibling;\n",
       "    if (outputDiv.id !== \"altair-viz-22c4436601894be7b50e66680e08528e\") {\n",
       "      outputDiv = document.getElementById(\"altair-viz-22c4436601894be7b50e66680e08528e\");\n",
       "    }\n",
       "    const paths = {\n",
       "      \"vega\": \"https://cdn.jsdelivr.net/npm/vega@5?noext\",\n",
       "      \"vega-lib\": \"https://cdn.jsdelivr.net/npm/vega-lib?noext\",\n",
       "      \"vega-lite\": \"https://cdn.jsdelivr.net/npm/vega-lite@5.17.0?noext\",\n",
       "      \"vega-embed\": \"https://cdn.jsdelivr.net/npm/vega-embed@6?noext\",\n",
       "    };\n",
       "\n",
       "    function maybeLoadScript(lib, version) {\n",
       "      var key = `${lib.replace(\"-\", \"\")}_version`;\n",
       "      return (VEGA_DEBUG[key] == version) ?\n",
       "        Promise.resolve(paths[lib]) :\n",
       "        new Promise(function(resolve, reject) {\n",
       "          var s = document.createElement('script');\n",
       "          document.getElementsByTagName(\"head\")[0].appendChild(s);\n",
       "          s.async = true;\n",
       "          s.onload = () => {\n",
       "            VEGA_DEBUG[key] = version;\n",
       "            return resolve(paths[lib]);\n",
       "          };\n",
       "          s.onerror = () => reject(`Error loading script: ${paths[lib]}`);\n",
       "          s.src = paths[lib];\n",
       "        });\n",
       "    }\n",
       "\n",
       "    function showError(err) {\n",
       "      outputDiv.innerHTML = `<div class=\"error\" style=\"color:red;\">${err}</div>`;\n",
       "      throw err;\n",
       "    }\n",
       "\n",
       "    function displayChart(vegaEmbed) {\n",
       "      vegaEmbed(outputDiv, spec, embedOpt)\n",
       "        .catch(err => showError(`Javascript Error: ${err.message}<br>This usually means there's a typo in your chart specification. See the javascript console for the full traceback.`));\n",
       "    }\n",
       "\n",
       "    if(typeof define === \"function\" && define.amd) {\n",
       "      requirejs.config({paths});\n",
       "      require([\"vega-embed\"], displayChart, err => showError(`Error loading script: ${err.message}`));\n",
       "    } else {\n",
       "      maybeLoadScript(\"vega\", \"5\")\n",
       "        .then(() => maybeLoadScript(\"vega-lite\", \"5.17.0\"))\n",
       "        .then(() => maybeLoadScript(\"vega-embed\", \"6\"))\n",
       "        .catch(showError)\n",
       "        .then(() => displayChart(vegaEmbed));\n",
       "    }\n",
       "  })({\"config\": {\"view\": {\"continuousWidth\": 300, \"continuousHeight\": 300, \"discreteHeight\": {\"step\": 30}, \"discreteWidth\": {\"step\": 40}}}, \"hconcat\": [{\"layer\": [{\"mark\": {\"type\": \"rect\"}, \"encoding\": {\"color\": {\"field\": \"score\", \"legend\": null, \"scale\": {\"domain\": [0, 1], \"scheme\": \"greenblue\"}, \"type\": \"quantitative\"}, \"x\": {\"field\": \"comparator\", \"title\": null, \"type\": \"ordinal\"}, \"y\": {\"axis\": {\"titleFontSize\": 14}, \"field\": \"strings_to_compare\", \"title\": \"String comparison\", \"type\": \"ordinal\"}}, \"title\": \"Similarity\"}, {\"mark\": {\"type\": \"text\", \"baseline\": \"middle\"}, \"encoding\": {\"size\": {\"field\": \"score\", \"legend\": null, \"scale\": {\"range\": [8, 14]}}, \"text\": {\"field\": \"score\", \"format\": \".2f\", \"type\": \"quantitative\"}, \"x\": {\"axis\": {\"labelFontSize\": 12}, \"field\": \"comparator\", \"type\": \"ordinal\"}, \"y\": {\"field\": \"strings_to_compare\", \"type\": \"ordinal\"}}}], \"data\": {\"name\": \"data-similarity\"}}, {\"layer\": [{\"mark\": {\"type\": \"rect\"}, \"encoding\": {\"color\": {\"field\": \"score\", \"legend\": null, \"scale\": {\"reverse\": true, \"scheme\": \"yelloworangered\"}, \"type\": \"quantitative\"}, \"x\": {\"axis\": {\"labelFontSize\": 12}, \"field\": \"comparator\", \"title\": null, \"type\": \"ordinal\"}, \"y\": {\"axis\": null, \"field\": \"strings_to_compare\", \"type\": \"ordinal\"}}, \"title\": \"Distance\"}, {\"mark\": {\"type\": \"text\", \"baseline\": \"middle\"}, \"encoding\": {\"size\": {\"field\": \"score\", \"legend\": null, \"scale\": {\"range\": [8, 14], \"reverse\": true}}, \"text\": {\"field\": \"score\", \"type\": \"quantitative\"}, \"x\": {\"field\": \"comparator\", \"type\": \"ordinal\"}, \"y\": {\"field\": \"strings_to_compare\", \"type\": \"ordinal\"}}}], \"data\": {\"name\": \"data-distance\"}}], \"datasets\": {\"data-similarity\": \"[{\\\"strings_to_compare\\\":\\\"Richard, Richard\\\",\\\"comparator\\\":\\\"jaro\\\",\\\"score\\\":1.0},{\\\"strings_to_compare\\\":\\\"Richard, ichard\\\",\\\"comparator\\\":\\\"jaro\\\",\\\"score\\\":0.95},{\\\"strings_to_compare\\\":\\\"Richard, Richar\\\",\\\"comparator\\\":\\\"jaro\\\",\\\"score\\\":0.95},{\\\"strings_to_compare\\\":\\\"Richard, iRchard\\\",\\\"comparator\\\":\\\"jaro\\\",\\\"score\\\":0.95},{\\\"strings_to_compare\\\":\\\"Richard, Richadr\\\",\\\"comparator\\\":\\\"jaro\\\",\\\"score\\\":0.95},{\\\"strings_to_compare\\\":\\\"Richard, Rich\\\",\\\"comparator\\\":\\\"jaro\\\",\\\"score\\\":0.86},{\\\"strings_to_compare\\\":\\\"Richard, Rick\\\",\\\"comparator\\\":\\\"jaro\\\",\\\"score\\\":0.73},{\\\"strings_to_compare\\\":\\\"Richard, Ricky\\\",\\\"comparator\\\":\\\"jaro\\\",\\\"score\\\":0.68},{\\\"strings_to_compare\\\":\\\"Richard, Dick\\\",\\\"comparator\\\":\\\"jaro\\\",\\\"score\\\":0.6},{\\\"strings_to_compare\\\":\\\"Richard, Rico\\\",\\\"comparator\\\":\\\"jaro\\\",\\\"score\\\":0.73},{\\\"strings_to_compare\\\":\\\"Richard, Rachael\\\",\\\"comparator\\\":\\\"jaro\\\",\\\"score\\\":0.71},{\\\"strings_to_compare\\\":\\\"Richard, Stephen\\\",\\\"comparator\\\":\\\"jaro\\\",\\\"score\\\":0.43},{\\\"strings_to_compare\\\":\\\"Richard, Richard\\\",\\\"comparator\\\":\\\"jaro_winkler\\\",\\\"score\\\":1.0},{\\\"strings_to_compare\\\":\\\"Richard, ichard\\\",\\\"comparator\\\":\\\"jaro_winkler\\\",\\\"score\\\":0.95},{\\\"strings_to_compare\\\":\\\"Richard, Richar\\\",\\\"comparator\\\":\\\"jaro_winkler\\\",\\\"score\\\":0.97},{\\\"strings_to_compare\\\":\\\"Richard, iRchard\\\",\\\"comparator\\\":\\\"jaro_winkler\\\",\\\"score\\\":0.95},{\\\"strings_to_compare\\\":\\\"Richard, Richadr\\\",\\\"comparator\\\":\\\"jaro_winkler\\\",\\\"score\\\":0.97},{\\\"strings_to_compare\\\":\\\"Richard, Rich\\\",\\\"comparator\\\":\\\"jaro_winkler\\\",\\\"score\\\":0.91},{\\\"strings_to_compare\\\":\\\"Richard, Rick\\\",\\\"comparator\\\":\\\"jaro_winkler\\\",\\\"score\\\":0.81},{\\\"strings_to_compare\\\":\\\"Richard, Ricky\\\",\\\"comparator\\\":\\\"jaro_winkler\\\",\\\"score\\\":0.68},{\\\"strings_to_compare\\\":\\\"Richard, Dick\\\",\\\"comparator\\\":\\\"jaro_winkler\\\",\\\"score\\\":0.6},{\\\"strings_to_compare\\\":\\\"Richard, Rico\\\",\\\"comparator\\\":\\\"jaro_winkler\\\",\\\"score\\\":0.81},{\\\"strings_to_compare\\\":\\\"Richard, Rachael\\\",\\\"comparator\\\":\\\"jaro_winkler\\\",\\\"score\\\":0.74},{\\\"strings_to_compare\\\":\\\"Richard, Stephen\\\",\\\"comparator\\\":\\\"jaro_winkler\\\",\\\"score\\\":0.43},{\\\"strings_to_compare\\\":\\\"Richard, Richard\\\",\\\"comparator\\\":\\\"jaccard\\\",\\\"score\\\":1.0},{\\\"strings_to_compare\\\":\\\"Richard, ichard\\\",\\\"comparator\\\":\\\"jaccard\\\",\\\"score\\\":0.86},{\\\"strings_to_compare\\\":\\\"Richard, Richar\\\",\\\"comparator\\\":\\\"jaccard\\\",\\\"score\\\":0.86},{\\\"strings_to_compare\\\":\\\"Richard, iRchard\\\",\\\"comparator\\\":\\\"jaccard\\\",\\\"score\\\":1.0},{\\\"strings_to_compare\\\":\\\"Richard, Richadr\\\",\\\"comparator\\\":\\\"jaccard\\\",\\\"score\\\":1.0},{\\\"strings_to_compare\\\":\\\"Richard, Rich\\\",\\\"comparator\\\":\\\"jaccard\\\",\\\"score\\\":0.57},{\\\"strings_to_compare\\\":\\\"Richard, Rick\\\",\\\"comparator\\\":\\\"jaccard\\\",\\\"score\\\":0.38},{\\\"strings_to_compare\\\":\\\"Richard, Ricky\\\",\\\"comparator\\\":\\\"jaccard\\\",\\\"score\\\":0.33},{\\\"strings_to_compare\\\":\\\"Richard, Dick\\\",\\\"comparator\\\":\\\"jaccard\\\",\\\"score\\\":0.22},{\\\"strings_to_compare\\\":\\\"Richard, Rico\\\",\\\"comparator\\\":\\\"jaccard\\\",\\\"score\\\":0.38},{\\\"strings_to_compare\\\":\\\"Richard, Rachael\\\",\\\"comparator\\\":\\\"jaccard\\\",\\\"score\\\":0.44},{\\\"strings_to_compare\\\":\\\"Richard, Stephen\\\",\\\"comparator\\\":\\\"jaccard\\\",\\\"score\\\":0.08}]\", \"data-distance\": \"[{\\\"strings_to_compare\\\":\\\"Richard, Richard\\\",\\\"comparator\\\":\\\"levenshtein\\\",\\\"score\\\":0.0},{\\\"strings_to_compare\\\":\\\"Richard, ichard\\\",\\\"comparator\\\":\\\"levenshtein\\\",\\\"score\\\":1.0},{\\\"strings_to_compare\\\":\\\"Richard, Richar\\\",\\\"comparator\\\":\\\"levenshtein\\\",\\\"score\\\":1.0},{\\\"strings_to_compare\\\":\\\"Richard, iRchard\\\",\\\"comparator\\\":\\\"levenshtein\\\",\\\"score\\\":2.0},{\\\"strings_to_compare\\\":\\\"Richard, Richadr\\\",\\\"comparator\\\":\\\"levenshtein\\\",\\\"score\\\":2.0},{\\\"strings_to_compare\\\":\\\"Richard, Rich\\\",\\\"comparator\\\":\\\"levenshtein\\\",\\\"score\\\":3.0},{\\\"strings_to_compare\\\":\\\"Richard, Rick\\\",\\\"comparator\\\":\\\"levenshtein\\\",\\\"score\\\":4.0},{\\\"strings_to_compare\\\":\\\"Richard, Ricky\\\",\\\"comparator\\\":\\\"levenshtein\\\",\\\"score\\\":4.0},{\\\"strings_to_compare\\\":\\\"Richard, Dick\\\",\\\"comparator\\\":\\\"levenshtein\\\",\\\"score\\\":5.0},{\\\"strings_to_compare\\\":\\\"Richard, Rico\\\",\\\"comparator\\\":\\\"levenshtein\\\",\\\"score\\\":4.0},{\\\"strings_to_compare\\\":\\\"Richard, Rachael\\\",\\\"comparator\\\":\\\"levenshtein\\\",\\\"score\\\":3.0},{\\\"strings_to_compare\\\":\\\"Richard, Stephen\\\",\\\"comparator\\\":\\\"levenshtein\\\",\\\"score\\\":7.0},{\\\"strings_to_compare\\\":\\\"Richard, Richard\\\",\\\"comparator\\\":\\\"damerau_levenshtein\\\",\\\"score\\\":0.0},{\\\"strings_to_compare\\\":\\\"Richard, ichard\\\",\\\"comparator\\\":\\\"damerau_levenshtein\\\",\\\"score\\\":1.0},{\\\"strings_to_compare\\\":\\\"Richard, Richar\\\",\\\"comparator\\\":\\\"damerau_levenshtein\\\",\\\"score\\\":1.0},{\\\"strings_to_compare\\\":\\\"Richard, iRchard\\\",\\\"comparator\\\":\\\"damerau_levenshtein\\\",\\\"score\\\":1.0},{\\\"strings_to_compare\\\":\\\"Richard, Richadr\\\",\\\"comparator\\\":\\\"damerau_levenshtein\\\",\\\"score\\\":1.0},{\\\"strings_to_compare\\\":\\\"Richard, Rich\\\",\\\"comparator\\\":\\\"damerau_levenshtein\\\",\\\"score\\\":3.0},{\\\"strings_to_compare\\\":\\\"Richard, Rick\\\",\\\"comparator\\\":\\\"damerau_levenshtein\\\",\\\"score\\\":4.0},{\\\"strings_to_compare\\\":\\\"Richard, Ricky\\\",\\\"comparator\\\":\\\"damerau_levenshtein\\\",\\\"score\\\":4.0},{\\\"strings_to_compare\\\":\\\"Richard, Dick\\\",\\\"comparator\\\":\\\"damerau_levenshtein\\\",\\\"score\\\":5.0},{\\\"strings_to_compare\\\":\\\"Richard, Rico\\\",\\\"comparator\\\":\\\"damerau_levenshtein\\\",\\\"score\\\":4.0},{\\\"strings_to_compare\\\":\\\"Richard, Rachael\\\",\\\"comparator\\\":\\\"damerau_levenshtein\\\",\\\"score\\\":3.0},{\\\"strings_to_compare\\\":\\\"Richard, Stephen\\\",\\\"comparator\\\":\\\"damerau_levenshtein\\\",\\\"score\\\":7.0}]\"}, \"resolve\": {\"scale\": {\"color\": \"independent\", \"size\": \"independent\", \"y\": \"shared\"}}, \"title\": {\"text\": \"Heatmaps of string comparison metrics\", \"anchor\": \"middle\", \"fontSize\": 16}, \"$schema\": \"https://vega.github.io/schema/vega-lite/v5.9.3.json\"}, {\"mode\": \"vega-lite\"});\n",
       "</script>"
      ],
      "text/plain": [
       "alt.HConcatChart(...)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sa.comparator_score_chart(data, \"string1\", \"string2\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we can see that all of the metrics are fairly sensitive to transcriptions errors (\"Richadr\", \"Richar\", \"iRchard\"). However, considering nicknames/aliases (\"Rick\", \"Ricky\", \"Rico\"), simple metrics such as Jaccard, Levenshtein and Damerau-Levenshtein tend to be less useful. The same can be said for name shortenings (\"Rich\"), but to a lesser extent than more complex nicknames. However, even more subtle metrics like Jaro and Jaro-Winkler still struggle to identify less obvious nicknames/aliases such as \"Dick\". "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you would prefer the underlying dataframe instead of the chart, there is the `comparator_score_df` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>string1</th>\n",
       "      <th>string2</th>\n",
       "      <th>levenshtein_distance</th>\n",
       "      <th>damerau_levenshtein_distance</th>\n",
       "      <th>jaro_similarity</th>\n",
       "      <th>jaro_winkler_similarity</th>\n",
       "      <th>jaccard_similarity</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Richard</td>\n",
       "      <td>Richard</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1.00</td>\n",
       "      <td>1.00</td>\n",
       "      <td>1.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Richard</td>\n",
       "      <td>ichard</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0.95</td>\n",
       "      <td>0.95</td>\n",
       "      <td>0.86</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Richard</td>\n",
       "      <td>Richar</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0.95</td>\n",
       "      <td>0.97</td>\n",
       "      <td>0.86</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Richard</td>\n",
       "      <td>iRchard</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0.95</td>\n",
       "      <td>0.95</td>\n",
       "      <td>1.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Richard</td>\n",
       "      <td>Richadr</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0.95</td>\n",
       "      <td>0.97</td>\n",
       "      <td>1.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Richard</td>\n",
       "      <td>Rich</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>0.86</td>\n",
       "      <td>0.91</td>\n",
       "      <td>0.57</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Richard</td>\n",
       "      <td>Rick</td>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>0.73</td>\n",
       "      <td>0.81</td>\n",
       "      <td>0.38</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Richard</td>\n",
       "      <td>Ricky</td>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>0.68</td>\n",
       "      <td>0.68</td>\n",
       "      <td>0.33</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Richard</td>\n",
       "      <td>Dick</td>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "      <td>0.60</td>\n",
       "      <td>0.60</td>\n",
       "      <td>0.22</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Richard</td>\n",
       "      <td>Rico</td>\n",
       "      <td>4</td>\n",
       "      <td>4</td>\n",
       "      <td>0.73</td>\n",
       "      <td>0.81</td>\n",
       "      <td>0.38</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Richard</td>\n",
       "      <td>Rachael</td>\n",
       "      <td>3</td>\n",
       "      <td>3</td>\n",
       "      <td>0.71</td>\n",
       "      <td>0.74</td>\n",
       "      <td>0.44</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>Richard</td>\n",
       "      <td>Stephen</td>\n",
       "      <td>7</td>\n",
       "      <td>7</td>\n",
       "      <td>0.43</td>\n",
       "      <td>0.43</td>\n",
       "      <td>0.08</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    string1  string2  levenshtein_distance  damerau_levenshtein_distance  \\\n",
       "0   Richard  Richard                     0                             0   \n",
       "1   Richard   ichard                     1                             1   \n",
       "2   Richard   Richar                     1                             1   \n",
       "3   Richard  iRchard                     2                             1   \n",
       "4   Richard  Richadr                     2                             1   \n",
       "5   Richard     Rich                     3                             3   \n",
       "6   Richard     Rick                     4                             4   \n",
       "7   Richard    Ricky                     4                             4   \n",
       "8   Richard     Dick                     5                             5   \n",
       "9   Richard     Rico                     4                             4   \n",
       "10  Richard  Rachael                     3                             3   \n",
       "11  Richard  Stephen                     7                             7   \n",
       "\n",
       "    jaro_similarity  jaro_winkler_similarity  jaccard_similarity  \n",
       "0              1.00                     1.00                1.00  \n",
       "1              0.95                     0.95                0.86  \n",
       "2              0.95                     0.97                0.86  \n",
       "3              0.95                     0.95                1.00  \n",
       "4              0.95                     0.97                1.00  \n",
       "5              0.86                     0.91                0.57  \n",
       "6              0.73                     0.81                0.38  \n",
       "7              0.68                     0.68                0.33  \n",
       "8              0.60                     0.60                0.22  \n",
       "9              0.73                     0.81                0.38  \n",
       "10             0.71                     0.74                0.44  \n",
       "11             0.43                     0.43                0.08  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sa.comparator_score_df(data, \"string1\", \"string2\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Choosing thresholds\n",
    "\n",
    "We can add distance and similarity thresholds to the comparators to see what strings would be included in a given comparison level:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<style>\n",
       "  #altair-viz-6e8c4e1721194630be10a45883fa40d2.vega-embed {\n",
       "    width: 100%;\n",
       "    display: flex;\n",
       "  }\n",
       "\n",
       "  #altair-viz-6e8c4e1721194630be10a45883fa40d2.vega-embed details,\n",
       "  #altair-viz-6e8c4e1721194630be10a45883fa40d2.vega-embed details summary {\n",
       "    position: relative;\n",
       "  }\n",
       "</style>\n",
       "<div id=\"altair-viz-6e8c4e1721194630be10a45883fa40d2\"></div>\n",
       "<script type=\"text/javascript\">\n",
       "  var VEGA_DEBUG = (typeof VEGA_DEBUG == \"undefined\") ? {} : VEGA_DEBUG;\n",
       "  (function(spec, embedOpt){\n",
       "    let outputDiv = document.currentScript.previousElementSibling;\n",
       "    if (outputDiv.id !== \"altair-viz-6e8c4e1721194630be10a45883fa40d2\") {\n",
       "      outputDiv = document.getElementById(\"altair-viz-6e8c4e1721194630be10a45883fa40d2\");\n",
       "    }\n",
       "    const paths = {\n",
       "      \"vega\": \"https://cdn.jsdelivr.net/npm/vega@5?noext\",\n",
       "      \"vega-lib\": \"https://cdn.jsdelivr.net/npm/vega-lib?noext\",\n",
       "      \"vega-lite\": \"https://cdn.jsdelivr.net/npm/vega-lite@5.17.0?noext\",\n",
       "      \"vega-embed\": \"https://cdn.jsdelivr.net/npm/vega-embed@6?noext\",\n",
       "    };\n",
       "\n",
       "    function maybeLoadScript(lib, version) {\n",
       "      var key = `${lib.replace(\"-\", \"\")}_version`;\n",
       "      return (VEGA_DEBUG[key] == version) ?\n",
       "        Promise.resolve(paths[lib]) :\n",
       "        new Promise(function(resolve, reject) {\n",
       "          var s = document.createElement('script');\n",
       "          document.getElementsByTagName(\"head\")[0].appendChild(s);\n",
       "          s.async = true;\n",
       "          s.onload = () => {\n",
       "            VEGA_DEBUG[key] = version;\n",
       "            return resolve(paths[lib]);\n",
       "          };\n",
       "          s.onerror = () => reject(`Error loading script: ${paths[lib]}`);\n",
       "          s.src = paths[lib];\n",
       "        });\n",
       "    }\n",
       "\n",
       "    function showError(err) {\n",
       "      outputDiv.innerHTML = `<div class=\"error\" style=\"color:red;\">${err}</div>`;\n",
       "      throw err;\n",
       "    }\n",
       "\n",
       "    function displayChart(vegaEmbed) {\n",
       "      vegaEmbed(outputDiv, spec, embedOpt)\n",
       "        .catch(err => showError(`Javascript Error: ${err.message}<br>This usually means there's a typo in your chart specification. See the javascript console for the full traceback.`));\n",
       "    }\n",
       "\n",
       "    if(typeof define === \"function\" && define.amd) {\n",
       "      requirejs.config({paths});\n",
       "      require([\"vega-embed\"], displayChart, err => showError(`Error loading script: ${err.message}`));\n",
       "    } else {\n",
       "      maybeLoadScript(\"vega\", \"5\")\n",
       "        .then(() => maybeLoadScript(\"vega-lite\", \"5.17.0\"))\n",
       "        .then(() => maybeLoadScript(\"vega-embed\", \"6\"))\n",
       "        .catch(showError)\n",
       "        .then(() => displayChart(vegaEmbed));\n",
       "    }\n",
       "  })({\"config\": {\"view\": {\"continuousWidth\": 300, \"continuousHeight\": 300, \"discreteHeight\": {\"step\": 30}, \"discreteWidth\": {\"step\": 40}}}, \"hconcat\": [{\"layer\": [{\"mark\": {\"type\": \"rect\"}, \"encoding\": {\"color\": {\"condition\": {\"test\": \"datum.score > similarity_threshold\", \"value\": \"lightgreen\"}, \"value\": \"lightgrey\"}, \"x\": {\"field\": \"comparator\", \"title\": null, \"type\": \"ordinal\"}, \"y\": {\"axis\": {\"titleFontSize\": 14}, \"field\": \"strings_to_compare\", \"title\": \"String comparison\", \"type\": \"ordinal\"}}, \"title\": {\"text\": \"Similarity\", \"subtitle\": \">= 0.8\"}}, {\"mark\": {\"type\": \"text\", \"baseline\": \"middle\"}, \"encoding\": {\"opacity\": {\"condition\": {\"test\": \"datum.score > 0.9\", \"value\": 1}, \"value\": 0.5}, \"size\": {\"field\": \"score\", \"legend\": null, \"scale\": {\"range\": [8, 14]}}, \"text\": {\"field\": \"score\", \"format\": \".2f\", \"type\": \"quantitative\"}, \"x\": {\"axis\": {\"labelFontSize\": 12}, \"field\": \"comparator\", \"type\": \"ordinal\"}, \"y\": {\"field\": \"strings_to_compare\", \"type\": \"ordinal\"}}}], \"data\": {\"name\": \"data-similarity\"}}, {\"layer\": [{\"mark\": {\"type\": \"rect\"}, \"encoding\": {\"color\": {\"condition\": {\"test\": \"datum.score <= distance_threshold\", \"value\": \"lightgreen\"}, \"value\": \"lightgrey\"}, \"x\": {\"axis\": {\"labelFontSize\": 12}, \"field\": \"comparator\", \"title\": null, \"type\": \"ordinal\"}, \"y\": {\"axis\": null, \"field\": \"strings_to_compare\", \"type\": \"ordinal\"}}, \"title\": {\"text\": \"Distance\", \"subtitle\": \"<= 2\"}}, {\"mark\": {\"type\": \"text\", \"baseline\": \"middle\"}, \"encoding\": {\"opacity\": {\"condition\": {\"test\": \"datum.score <= 2\", \"value\": 1}, \"value\": 0.5}, \"size\": {\"field\": \"score\", \"legend\": null, \"scale\": {\"range\": [8, 14], \"reverse\": true}}, \"text\": {\"field\": \"score\", \"type\": \"quantitative\"}, \"x\": {\"field\": \"comparator\", \"type\": \"ordinal\"}, \"y\": {\"field\": \"strings_to_compare\", \"type\": \"ordinal\"}}}], \"data\": {\"name\": \"data-distance\"}}], \"datasets\": {\"data-similarity\": \"[{\\\"strings_to_compare\\\":\\\"Richard, Richard\\\",\\\"comparator\\\":\\\"jaro\\\",\\\"score\\\":1.0},{\\\"strings_to_compare\\\":\\\"Richard, ichard\\\",\\\"comparator\\\":\\\"jaro\\\",\\\"score\\\":0.95},{\\\"strings_to_compare\\\":\\\"Richard, Richar\\\",\\\"comparator\\\":\\\"jaro\\\",\\\"score\\\":0.95},{\\\"strings_to_compare\\\":\\\"Richard, iRchard\\\",\\\"comparator\\\":\\\"jaro\\\",\\\"score\\\":0.95},{\\\"strings_to_compare\\\":\\\"Richard, Richadr\\\",\\\"comparator\\\":\\\"jaro\\\",\\\"score\\\":0.95},{\\\"strings_to_compare\\\":\\\"Richard, Rich\\\",\\\"comparator\\\":\\\"jaro\\\",\\\"score\\\":0.86},{\\\"strings_to_compare\\\":\\\"Richard, Rick\\\",\\\"comparator\\\":\\\"jaro\\\",\\\"score\\\":0.73},{\\\"strings_to_compare\\\":\\\"Richard, Ricky\\\",\\\"comparator\\\":\\\"jaro\\\",\\\"score\\\":0.68},{\\\"strings_to_compare\\\":\\\"Richard, Dick\\\",\\\"comparator\\\":\\\"jaro\\\",\\\"score\\\":0.6},{\\\"strings_to_compare\\\":\\\"Richard, Rico\\\",\\\"comparator\\\":\\\"jaro\\\",\\\"score\\\":0.73},{\\\"strings_to_compare\\\":\\\"Richard, Rachael\\\",\\\"comparator\\\":\\\"jaro\\\",\\\"score\\\":0.71},{\\\"strings_to_compare\\\":\\\"Richard, Stephen\\\",\\\"comparator\\\":\\\"jaro\\\",\\\"score\\\":0.43},{\\\"strings_to_compare\\\":\\\"Richard, Richard\\\",\\\"comparator\\\":\\\"jaro_winkler\\\",\\\"score\\\":1.0},{\\\"strings_to_compare\\\":\\\"Richard, ichard\\\",\\\"comparator\\\":\\\"jaro_winkler\\\",\\\"score\\\":0.95},{\\\"strings_to_compare\\\":\\\"Richard, Richar\\\",\\\"comparator\\\":\\\"jaro_winkler\\\",\\\"score\\\":0.97},{\\\"strings_to_compare\\\":\\\"Richard, iRchard\\\",\\\"comparator\\\":\\\"jaro_winkler\\\",\\\"score\\\":0.95},{\\\"strings_to_compare\\\":\\\"Richard, Richadr\\\",\\\"comparator\\\":\\\"jaro_winkler\\\",\\\"score\\\":0.97},{\\\"strings_to_compare\\\":\\\"Richard, Rich\\\",\\\"comparator\\\":\\\"jaro_winkler\\\",\\\"score\\\":0.91},{\\\"strings_to_compare\\\":\\\"Richard, Rick\\\",\\\"comparator\\\":\\\"jaro_winkler\\\",\\\"score\\\":0.81},{\\\"strings_to_compare\\\":\\\"Richard, Ricky\\\",\\\"comparator\\\":\\\"jaro_winkler\\\",\\\"score\\\":0.68},{\\\"strings_to_compare\\\":\\\"Richard, Dick\\\",\\\"comparator\\\":\\\"jaro_winkler\\\",\\\"score\\\":0.6},{\\\"strings_to_compare\\\":\\\"Richard, Rico\\\",\\\"comparator\\\":\\\"jaro_winkler\\\",\\\"score\\\":0.81},{\\\"strings_to_compare\\\":\\\"Richard, Rachael\\\",\\\"comparator\\\":\\\"jaro_winkler\\\",\\\"score\\\":0.74},{\\\"strings_to_compare\\\":\\\"Richard, Stephen\\\",\\\"comparator\\\":\\\"jaro_winkler\\\",\\\"score\\\":0.43},{\\\"strings_to_compare\\\":\\\"Richard, Richard\\\",\\\"comparator\\\":\\\"jaccard\\\",\\\"score\\\":1.0},{\\\"strings_to_compare\\\":\\\"Richard, ichard\\\",\\\"comparator\\\":\\\"jaccard\\\",\\\"score\\\":0.86},{\\\"strings_to_compare\\\":\\\"Richard, Richar\\\",\\\"comparator\\\":\\\"jaccard\\\",\\\"score\\\":0.86},{\\\"strings_to_compare\\\":\\\"Richard, iRchard\\\",\\\"comparator\\\":\\\"jaccard\\\",\\\"score\\\":1.0},{\\\"strings_to_compare\\\":\\\"Richard, Richadr\\\",\\\"comparator\\\":\\\"jaccard\\\",\\\"score\\\":1.0},{\\\"strings_to_compare\\\":\\\"Richard, Rich\\\",\\\"comparator\\\":\\\"jaccard\\\",\\\"score\\\":0.57},{\\\"strings_to_compare\\\":\\\"Richard, Rick\\\",\\\"comparator\\\":\\\"jaccard\\\",\\\"score\\\":0.38},{\\\"strings_to_compare\\\":\\\"Richard, Ricky\\\",\\\"comparator\\\":\\\"jaccard\\\",\\\"score\\\":0.33},{\\\"strings_to_compare\\\":\\\"Richard, Dick\\\",\\\"comparator\\\":\\\"jaccard\\\",\\\"score\\\":0.22},{\\\"strings_to_compare\\\":\\\"Richard, Rico\\\",\\\"comparator\\\":\\\"jaccard\\\",\\\"score\\\":0.38},{\\\"strings_to_compare\\\":\\\"Richard, Rachael\\\",\\\"comparator\\\":\\\"jaccard\\\",\\\"score\\\":0.44},{\\\"strings_to_compare\\\":\\\"Richard, Stephen\\\",\\\"comparator\\\":\\\"jaccard\\\",\\\"score\\\":0.08}]\", \"data-distance\": \"[{\\\"strings_to_compare\\\":\\\"Richard, Richard\\\",\\\"comparator\\\":\\\"levenshtein\\\",\\\"score\\\":0.0},{\\\"strings_to_compare\\\":\\\"Richard, ichard\\\",\\\"comparator\\\":\\\"levenshtein\\\",\\\"score\\\":1.0},{\\\"strings_to_compare\\\":\\\"Richard, Richar\\\",\\\"comparator\\\":\\\"levenshtein\\\",\\\"score\\\":1.0},{\\\"strings_to_compare\\\":\\\"Richard, iRchard\\\",\\\"comparator\\\":\\\"levenshtein\\\",\\\"score\\\":2.0},{\\\"strings_to_compare\\\":\\\"Richard, Richadr\\\",\\\"comparator\\\":\\\"levenshtein\\\",\\\"score\\\":2.0},{\\\"strings_to_compare\\\":\\\"Richard, Rich\\\",\\\"comparator\\\":\\\"levenshtein\\\",\\\"score\\\":3.0},{\\\"strings_to_compare\\\":\\\"Richard, Rick\\\",\\\"comparator\\\":\\\"levenshtein\\\",\\\"score\\\":4.0},{\\\"strings_to_compare\\\":\\\"Richard, Ricky\\\",\\\"comparator\\\":\\\"levenshtein\\\",\\\"score\\\":4.0},{\\\"strings_to_compare\\\":\\\"Richard, Dick\\\",\\\"comparator\\\":\\\"levenshtein\\\",\\\"score\\\":5.0},{\\\"strings_to_compare\\\":\\\"Richard, Rico\\\",\\\"comparator\\\":\\\"levenshtein\\\",\\\"score\\\":4.0},{\\\"strings_to_compare\\\":\\\"Richard, Rachael\\\",\\\"comparator\\\":\\\"levenshtein\\\",\\\"score\\\":3.0},{\\\"strings_to_compare\\\":\\\"Richard, Stephen\\\",\\\"comparator\\\":\\\"levenshtein\\\",\\\"score\\\":7.0},{\\\"strings_to_compare\\\":\\\"Richard, Richard\\\",\\\"comparator\\\":\\\"damerau_levenshtein\\\",\\\"score\\\":0.0},{\\\"strings_to_compare\\\":\\\"Richard, ichard\\\",\\\"comparator\\\":\\\"damerau_levenshtein\\\",\\\"score\\\":1.0},{\\\"strings_to_compare\\\":\\\"Richard, Richar\\\",\\\"comparator\\\":\\\"damerau_levenshtein\\\",\\\"score\\\":1.0},{\\\"strings_to_compare\\\":\\\"Richard, iRchard\\\",\\\"comparator\\\":\\\"damerau_levenshtein\\\",\\\"score\\\":1.0},{\\\"strings_to_compare\\\":\\\"Richard, Richadr\\\",\\\"comparator\\\":\\\"damerau_levenshtein\\\",\\\"score\\\":1.0},{\\\"strings_to_compare\\\":\\\"Richard, Rich\\\",\\\"comparator\\\":\\\"damerau_levenshtein\\\",\\\"score\\\":3.0},{\\\"strings_to_compare\\\":\\\"Richard, Rick\\\",\\\"comparator\\\":\\\"damerau_levenshtein\\\",\\\"score\\\":4.0},{\\\"strings_to_compare\\\":\\\"Richard, Ricky\\\",\\\"comparator\\\":\\\"damerau_levenshtein\\\",\\\"score\\\":4.0},{\\\"strings_to_compare\\\":\\\"Richard, Dick\\\",\\\"comparator\\\":\\\"damerau_levenshtein\\\",\\\"score\\\":5.0},{\\\"strings_to_compare\\\":\\\"Richard, Rico\\\",\\\"comparator\\\":\\\"damerau_levenshtein\\\",\\\"score\\\":4.0},{\\\"strings_to_compare\\\":\\\"Richard, Rachael\\\",\\\"comparator\\\":\\\"damerau_levenshtein\\\",\\\"score\\\":3.0},{\\\"strings_to_compare\\\":\\\"Richard, Stephen\\\",\\\"comparator\\\":\\\"damerau_levenshtein\\\",\\\"score\\\":7.0}]\"}, \"params\": [{\"name\": \"similarity_threshold\", \"value\": 0.8}, {\"name\": \"distance_threshold\", \"value\": 2}], \"resolve\": {\"scale\": {\"color\": \"independent\", \"size\": \"independent\", \"y\": \"shared\"}}, \"title\": {\"text\": \"Heatmaps of string comparison metrics\", \"anchor\": \"middle\", \"fontSize\": 16}, \"$schema\": \"https://vega.github.io/schema/vega-lite/v5.9.3.json\"}, {\"mode\": \"vega-lite\"});\n",
       "</script>"
      ],
      "text/plain": [
       "alt.HConcatChart(...)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sa.comparator_score_threshold_chart(\n",
    "    data, \"string1\", \"string2\", distance_threshold=2, similarity_threshold=0.8\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To class our variations on \"Richard\" in the same `Comparison Level`, a good choice of metric could be Jaro-Winkler with a threshold of 0.8. Lowering the threshold any more could increase the chances for false positives. \n",
    "\n",
    "For example, consider a single Jaro-Winkler `Comparison Level` threshold of 0.7 would lead to \"Rachael\" being considered as providing the same amount evidence for a record matching as \"iRchard\".\n",
    "\n",
    "An alternative way around this is to construct a `Comparison` with multiple levels, each corresponding to a different threshold of Jaro-Winkler similarity. For example, below we construct a `Comparison` using the `Comparison Library` function [JaroWinklerAtThresholds](../../api_docs/comparison_library.md#splink.comparison_library.JaroWinklerAtThresholds) with multiple levels for different match thresholds.:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "import splink.comparison_library as cl\n",
    "\n",
    "first_name_comparison = cl.JaroWinklerAtThresholds(\"first_name\", [0.9, 0.8, 0.7])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we print this comparison as a dictionary we can see the underlying SQL."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'output_column_name': 'first_name',\n",
       " 'comparison_levels': [{'sql_condition': '\"first_name_l\" IS NULL OR \"first_name_r\" IS NULL',\n",
       "   'label_for_charts': 'first_name is NULL',\n",
       "   'is_null_level': True},\n",
       "  {'sql_condition': '\"first_name_l\" = \"first_name_r\"',\n",
       "   'label_for_charts': 'Exact match on first_name'},\n",
       "  {'sql_condition': 'jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.9',\n",
       "   'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.9'},\n",
       "  {'sql_condition': 'jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.8',\n",
       "   'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.8'},\n",
       "  {'sql_condition': 'jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.7',\n",
       "   'label_for_charts': 'Jaro-Winkler distance of first_name >= 0.7'},\n",
       "  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],\n",
       " 'comparison_description': 'JaroWinklerAtThresholds'}"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "first_name_comparison.get_comparison(\"duckdb\").as_dict()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Where:  \n",
    "\n",
    "* Exact Match level will catch perfect matches (\"Richard\").  \n",
    "* The 0.9 threshold will catch Shortenings and Typos (\"ichard\", \"Richar\", \"iRchard\", \"Richadr\",  \"Rich\").  \n",
    "* The 0.8 threshold will catch simple Nicknames/Aliases (\"Rick\", \"Rico\").  \n",
    "* The 0.7 threshold will catch more complex Nicknames/Aliases (\"Ricky\"), but will also include less relevant names (e.g. \"Rachael\"). However, this should not be a concern as the model should give less predictive power (i.e. Match Weight) to this level of evidence.  \n",
    "* All other comparisons will end up in the \"Else\" level  "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Phonetic Matching \n",
    "\n",
    "There are similar functions available within splink to help users get familiar with phonetic transformations. You can create similar visualisations to string comparators.\n",
    "\n",
    "To see the phonetic transformations for a single string, there is the `phonetic_transform` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'soundex': 'R02063', 'metaphone': 'RXRT', 'dmetaphone': ('RXRT', 'RKRT')}"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sa.phonetic_transform(\"Richard\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'soundex': 'S30105', 'metaphone': 'STFN', 'dmetaphone': ('STFN', '')}"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sa.phonetic_transform(\"Steven\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now consider a collection of common variations of the name \"Stephen\". Which phonetic transforms will consider these as sufficiently similar to \"Stephen\"?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>string1</th>\n",
       "      <th>string2</th>\n",
       "      <th>error_type</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Stephen</td>\n",
       "      <td>Stephen</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Stephen</td>\n",
       "      <td>Steven</td>\n",
       "      <td>Spelling Variation</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Stephen</td>\n",
       "      <td>Stephan</td>\n",
       "      <td>Spelling Variation/Similar Name</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Stephen</td>\n",
       "      <td>Steve</td>\n",
       "      <td>Nickname/Alias</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Stephen</td>\n",
       "      <td>Stehpen</td>\n",
       "      <td>Transposition</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Stephen</td>\n",
       "      <td>tSephen</td>\n",
       "      <td>Transposition</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Stephen</td>\n",
       "      <td>Stephne</td>\n",
       "      <td>Transposition</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Stephen</td>\n",
       "      <td>Stphen</td>\n",
       "      <td>Deletion</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Stephen</td>\n",
       "      <td>Stepheb</td>\n",
       "      <td>Replacement</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Stephen</td>\n",
       "      <td>Stephanie</td>\n",
       "      <td>Different Name</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Stephen</td>\n",
       "      <td>Richard</td>\n",
       "      <td>Different Name</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    string1    string2                       error_type\n",
       "0   Stephen    Stephen                             None\n",
       "1   Stephen     Steven               Spelling Variation\n",
       "2   Stephen    Stephan  Spelling Variation/Similar Name\n",
       "3   Stephen      Steve                   Nickname/Alias\n",
       "4   Stephen    Stehpen                    Transposition\n",
       "5   Stephen    tSephen                    Transposition\n",
       "6   Stephen    Stephne                    Transposition\n",
       "7   Stephen     Stphen                         Deletion\n",
       "8   Stephen    Stepheb                      Replacement\n",
       "9   Stephen  Stephanie                   Different Name\n",
       "10  Stephen    Richard                   Different Name"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = [\n",
    "    {\"string1\": \"Stephen\", \"string2\": \"Stephen\", \"error_type\": \"None\"},\n",
    "    {\"string1\": \"Stephen\", \"string2\": \"Steven\", \"error_type\": \"Spelling Variation\"},\n",
    "    {\"string1\": \"Stephen\", \"string2\": \"Stephan\", \"error_type\": \"Spelling Variation/Similar Name\"},\n",
    "    {\"string1\": \"Stephen\", \"string2\": \"Steve\", \"error_type\": \"Nickname/Alias\"},\n",
    "    {\"string1\": \"Stephen\", \"string2\": \"Stehpen\", \"error_type\": \"Transposition\"},\n",
    "    {\"string1\": \"Stephen\", \"string2\": \"tSephen\", \"error_type\": \"Transposition\"},\n",
    "    {\"string1\": \"Stephen\", \"string2\": \"Stephne\", \"error_type\": \"Transposition\"},\n",
    "    {\"string1\": \"Stephen\", \"string2\": \"Stphen\", \"error_type\": \"Deletion\"},\n",
    "    {\"string1\": \"Stephen\", \"string2\": \"Stepheb\", \"error_type\": \"Replacement\"},\n",
    "    {\"string1\": \"Stephen\", \"string2\": \"Stephanie\", \"error_type\": \"Different Name\"},\n",
    "    {\"string1\": \"Stephen\", \"string2\": \"Richard\", \"error_type\": \"Different Name\"},\n",
    "]\n",
    "\n",
    "\n",
    "df = pd.DataFrame(data)\n",
    "df"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `phonetic_match_chart` function allows you to compare two lists of strings and how similar the elements are according to the available string similarity and distance metrics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<style>\n",
       "  #altair-viz-a67fb15ca9dd434088abec82fbed842c.vega-embed {\n",
       "    width: 100%;\n",
       "    display: flex;\n",
       "  }\n",
       "\n",
       "  #altair-viz-a67fb15ca9dd434088abec82fbed842c.vega-embed details,\n",
       "  #altair-viz-a67fb15ca9dd434088abec82fbed842c.vega-embed details summary {\n",
       "    position: relative;\n",
       "  }\n",
       "</style>\n",
       "<div id=\"altair-viz-a67fb15ca9dd434088abec82fbed842c\"></div>\n",
       "<script type=\"text/javascript\">\n",
       "  var VEGA_DEBUG = (typeof VEGA_DEBUG == \"undefined\") ? {} : VEGA_DEBUG;\n",
       "  (function(spec, embedOpt){\n",
       "    let outputDiv = document.currentScript.previousElementSibling;\n",
       "    if (outputDiv.id !== \"altair-viz-a67fb15ca9dd434088abec82fbed842c\") {\n",
       "      outputDiv = document.getElementById(\"altair-viz-a67fb15ca9dd434088abec82fbed842c\");\n",
       "    }\n",
       "    const paths = {\n",
       "      \"vega\": \"https://cdn.jsdelivr.net/npm/vega@5?noext\",\n",
       "      \"vega-lib\": \"https://cdn.jsdelivr.net/npm/vega-lib?noext\",\n",
       "      \"vega-lite\": \"https://cdn.jsdelivr.net/npm/vega-lite@5.17.0?noext\",\n",
       "      \"vega-embed\": \"https://cdn.jsdelivr.net/npm/vega-embed@6?noext\",\n",
       "    };\n",
       "\n",
       "    function maybeLoadScript(lib, version) {\n",
       "      var key = `${lib.replace(\"-\", \"\")}_version`;\n",
       "      return (VEGA_DEBUG[key] == version) ?\n",
       "        Promise.resolve(paths[lib]) :\n",
       "        new Promise(function(resolve, reject) {\n",
       "          var s = document.createElement('script');\n",
       "          document.getElementsByTagName(\"head\")[0].appendChild(s);\n",
       "          s.async = true;\n",
       "          s.onload = () => {\n",
       "            VEGA_DEBUG[key] = version;\n",
       "            return resolve(paths[lib]);\n",
       "          };\n",
       "          s.onerror = () => reject(`Error loading script: ${paths[lib]}`);\n",
       "          s.src = paths[lib];\n",
       "        });\n",
       "    }\n",
       "\n",
       "    function showError(err) {\n",
       "      outputDiv.innerHTML = `<div class=\"error\" style=\"color:red;\">${err}</div>`;\n",
       "      throw err;\n",
       "    }\n",
       "\n",
       "    function displayChart(vegaEmbed) {\n",
       "      vegaEmbed(outputDiv, spec, embedOpt)\n",
       "        .catch(err => showError(`Javascript Error: ${err.message}<br>This usually means there's a typo in your chart specification. See the javascript console for the full traceback.`));\n",
       "    }\n",
       "\n",
       "    if(typeof define === \"function\" && define.amd) {\n",
       "      requirejs.config({paths});\n",
       "      require([\"vega-embed\"], displayChart, err => showError(`Error loading script: ${err.message}`));\n",
       "    } else {\n",
       "      maybeLoadScript(\"vega\", \"5\")\n",
       "        .then(() => maybeLoadScript(\"vega-lite\", \"5.17.0\"))\n",
       "        .then(() => maybeLoadScript(\"vega-embed\", \"6\"))\n",
       "        .catch(showError)\n",
       "        .then(() => displayChart(vegaEmbed));\n",
       "    }\n",
       "  })({\"config\": {\"view\": {\"continuousWidth\": 300, \"continuousHeight\": 300}}, \"layer\": [{\"mark\": {\"type\": \"rect\"}, \"encoding\": {\"color\": {\"field\": \"match\", \"legend\": {\"labelExpr\": \"{'true': 'Match', 'false': 'Non-match'}[datum.label]\", \"labelFontWeight\": \"bold\", \"symbolSize\": 1000, \"title\": null}, \"scale\": {\"range\": [\"lightgray\", \"lightgreen\"]}, \"type\": \"nominal\"}, \"x\": {\"axis\": {\"labelAlign\": \"center\", \"labelAngle\": -10, \"labelFontWeight\": \"bold\", \"orient\": \"top\"}, \"field\": \"phonetic\", \"title\": null, \"type\": \"nominal\"}, \"y\": {\"axis\": {\"titleFontSize\": 14}, \"field\": \"strings_to_compare\", \"title\": \"String comparison\", \"type\": \"ordinal\"}}}, {\"mark\": {\"type\": \"text\", \"baseline\": \"bottom\"}, \"encoding\": {\"opacity\": {\"condition\": {\"test\": \"datum.match\", \"value\": 1}, \"value\": 0.5}, \"text\": {\"field\": \"transform\", \"type\": \"nominal\"}, \"x\": {\"axis\": {\"labelFontSize\": 12}, \"field\": \"phonetic\", \"type\": \"ordinal\"}, \"y\": {\"field\": \"strings_to_compare\", \"type\": \"ordinal\"}}}], \"data\": {\"name\": \"data-phonetic\"}, \"datasets\": {\"data-phonetic\": \"[{\\\"strings_to_compare\\\":\\\"Stephen, Stephen\\\",\\\"phonetic\\\":\\\"metaphone\\\",\\\"transform\\\":[\\\"STFN\\\",\\\"STFN\\\"],\\\"match\\\":true},{\\\"strings_to_compare\\\":\\\"Stephen, Steven\\\",\\\"phonetic\\\":\\\"metaphone\\\",\\\"transform\\\":[\\\"STFN\\\",\\\"STFN\\\"],\\\"match\\\":true},{\\\"strings_to_compare\\\":\\\"Stephen, Stephan\\\",\\\"phonetic\\\":\\\"metaphone\\\",\\\"transform\\\":[\\\"STFN\\\",\\\"STFN\\\"],\\\"match\\\":true},{\\\"strings_to_compare\\\":\\\"Stephen, Steve\\\",\\\"phonetic\\\":\\\"metaphone\\\",\\\"transform\\\":[\\\"STFN\\\",\\\"STF\\\"],\\\"match\\\":false},{\\\"strings_to_compare\\\":\\\"Stephen, Stehpen\\\",\\\"phonetic\\\":\\\"metaphone\\\",\\\"transform\\\":[\\\"STFN\\\",\\\"STPN\\\"],\\\"match\\\":false},{\\\"strings_to_compare\\\":\\\"Stephen, tSephen\\\",\\\"phonetic\\\":\\\"metaphone\\\",\\\"transform\\\":[\\\"STFN\\\",\\\"TSFN\\\"],\\\"match\\\":false},{\\\"strings_to_compare\\\":\\\"Stephen, Stephne\\\",\\\"phonetic\\\":\\\"metaphone\\\",\\\"transform\\\":[\\\"STFN\\\",\\\"STFN\\\"],\\\"match\\\":true},{\\\"strings_to_compare\\\":\\\"Stephen, Stphen\\\",\\\"phonetic\\\":\\\"metaphone\\\",\\\"transform\\\":[\\\"STFN\\\",\\\"STFN\\\"],\\\"match\\\":true},{\\\"strings_to_compare\\\":\\\"Stephen, Stepheb\\\",\\\"phonetic\\\":\\\"metaphone\\\",\\\"transform\\\":[\\\"STFN\\\",\\\"STFP\\\"],\\\"match\\\":false},{\\\"strings_to_compare\\\":\\\"Stephen, Stephanie\\\",\\\"phonetic\\\":\\\"metaphone\\\",\\\"transform\\\":[\\\"STFN\\\",\\\"STFN\\\"],\\\"match\\\":true},{\\\"strings_to_compare\\\":\\\"Stephen, Richard\\\",\\\"phonetic\\\":\\\"metaphone\\\",\\\"transform\\\":[\\\"STFN\\\",\\\"RXRT\\\"],\\\"match\\\":false},{\\\"strings_to_compare\\\":\\\"Stephen, Stephen\\\",\\\"phonetic\\\":\\\"dmetaphone\\\",\\\"transform\\\":[[\\\"STFN\\\",\\\"\\\"],[\\\"STFN\\\",\\\"\\\"]],\\\"match\\\":true},{\\\"strings_to_compare\\\":\\\"Stephen, Steven\\\",\\\"phonetic\\\":\\\"dmetaphone\\\",\\\"transform\\\":[[\\\"STFN\\\",\\\"\\\"],[\\\"STFN\\\",\\\"\\\"]],\\\"match\\\":true},{\\\"strings_to_compare\\\":\\\"Stephen, Stephan\\\",\\\"phonetic\\\":\\\"dmetaphone\\\",\\\"transform\\\":[[\\\"STFN\\\",\\\"\\\"],[\\\"STFN\\\",\\\"\\\"]],\\\"match\\\":true},{\\\"strings_to_compare\\\":\\\"Stephen, Steve\\\",\\\"phonetic\\\":\\\"dmetaphone\\\",\\\"transform\\\":[[\\\"STFN\\\",\\\"\\\"],[\\\"STF\\\",\\\"\\\"]],\\\"match\\\":false},{\\\"strings_to_compare\\\":\\\"Stephen, Stehpen\\\",\\\"phonetic\\\":\\\"dmetaphone\\\",\\\"transform\\\":[[\\\"STFN\\\",\\\"\\\"],[\\\"STPN\\\",\\\"\\\"]],\\\"match\\\":false},{\\\"strings_to_compare\\\":\\\"Stephen, tSephen\\\",\\\"phonetic\\\":\\\"dmetaphone\\\",\\\"transform\\\":[[\\\"STFN\\\",\\\"\\\"],[\\\"TSFN\\\",\\\"\\\"]],\\\"match\\\":false},{\\\"strings_to_compare\\\":\\\"Stephen, Stephne\\\",\\\"phonetic\\\":\\\"dmetaphone\\\",\\\"transform\\\":[[\\\"STFN\\\",\\\"\\\"],[\\\"STFN\\\",\\\"\\\"]],\\\"match\\\":true},{\\\"strings_to_compare\\\":\\\"Stephen, Stphen\\\",\\\"phonetic\\\":\\\"dmetaphone\\\",\\\"transform\\\":[[\\\"STFN\\\",\\\"\\\"],[\\\"STFN\\\",\\\"\\\"]],\\\"match\\\":true},{\\\"strings_to_compare\\\":\\\"Stephen, Stepheb\\\",\\\"phonetic\\\":\\\"dmetaphone\\\",\\\"transform\\\":[[\\\"STFN\\\",\\\"\\\"],[\\\"STFP\\\",\\\"\\\"]],\\\"match\\\":false},{\\\"strings_to_compare\\\":\\\"Stephen, Stephanie\\\",\\\"phonetic\\\":\\\"dmetaphone\\\",\\\"transform\\\":[[\\\"STFN\\\",\\\"\\\"],[\\\"STFN\\\",\\\"\\\"]],\\\"match\\\":true},{\\\"strings_to_compare\\\":\\\"Stephen, Richard\\\",\\\"phonetic\\\":\\\"dmetaphone\\\",\\\"transform\\\":[[\\\"STFN\\\",\\\"\\\"],[\\\"RXRT\\\",\\\"RKRT\\\"]],\\\"match\\\":false},{\\\"strings_to_compare\\\":\\\"Stephen, Stephen\\\",\\\"phonetic\\\":\\\"soundex\\\",\\\"transform\\\":[\\\"S30105\\\",\\\"S30105\\\"],\\\"match\\\":true},{\\\"strings_to_compare\\\":\\\"Stephen, Steven\\\",\\\"phonetic\\\":\\\"soundex\\\",\\\"transform\\\":[\\\"S30105\\\",\\\"S30105\\\"],\\\"match\\\":true},{\\\"strings_to_compare\\\":\\\"Stephen, Stephan\\\",\\\"phonetic\\\":\\\"soundex\\\",\\\"transform\\\":[\\\"S30105\\\",\\\"S30105\\\"],\\\"match\\\":true},{\\\"strings_to_compare\\\":\\\"Stephen, Steve\\\",\\\"phonetic\\\":\\\"soundex\\\",\\\"transform\\\":[\\\"S30105\\\",\\\"S3010\\\"],\\\"match\\\":false},{\\\"strings_to_compare\\\":\\\"Stephen, Stehpen\\\",\\\"phonetic\\\":\\\"soundex\\\",\\\"transform\\\":[\\\"S30105\\\",\\\"S30105\\\"],\\\"match\\\":true},{\\\"strings_to_compare\\\":\\\"Stephen, tSephen\\\",\\\"phonetic\\\":\\\"soundex\\\",\\\"transform\\\":[\\\"S30105\\\",\\\"t50105\\\"],\\\"match\\\":false},{\\\"strings_to_compare\\\":\\\"Stephen, Stephne\\\",\\\"phonetic\\\":\\\"soundex\\\",\\\"transform\\\":[\\\"S30105\\\",\\\"S301050\\\"],\\\"match\\\":false},{\\\"strings_to_compare\\\":\\\"Stephen, Stphen\\\",\\\"phonetic\\\":\\\"soundex\\\",\\\"transform\\\":[\\\"S30105\\\",\\\"S3105\\\"],\\\"match\\\":false},{\\\"strings_to_compare\\\":\\\"Stephen, Stepheb\\\",\\\"phonetic\\\":\\\"soundex\\\",\\\"transform\\\":[\\\"S30105\\\",\\\"S30101\\\"],\\\"match\\\":false},{\\\"strings_to_compare\\\":\\\"Stephen, Stephanie\\\",\\\"phonetic\\\":\\\"soundex\\\",\\\"transform\\\":[\\\"S30105\\\",\\\"S301050\\\"],\\\"match\\\":false},{\\\"strings_to_compare\\\":\\\"Stephen, Richard\\\",\\\"phonetic\\\":\\\"soundex\\\",\\\"transform\\\":[\\\"S30105\\\",\\\"R02063\\\"],\\\"match\\\":false}]\"}, \"height\": {\"step\": 40}, \"title\": {\"text\": \"Heatmap of Phonetic Matches\", \"anchor\": \"middle\", \"fontSize\": 16}, \"width\": {\"step\": 70}, \"$schema\": \"https://vega.github.io/schema/vega-lite/v5.9.3.json\"}, {\"mode\": \"vega-lite\"});\n",
       "</script>"
      ],
      "text/plain": [
       "alt.LayerChart(...)"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sa.phonetic_match_chart(data, \"string1\", \"string2\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we can see that all of the algorithms recognise simple phonetically similar names (\"Stephen\", \"Steven\"). However, there is some variation when it comes to transposition errors (\"Stehpen\", \"Stephne\") with soundex and metaphone-esque giving different results. There is also different behaviour considering different names (\"Stephanie\").\n",
    "\n",
    "Given there is no clear winner that captures all of the similar names, it is recommended that phonetic matches are used as a single `Comparison Level` within in a `Comparison` which also includes [string comparators](./comparators.md) in the other levels. To see an example of this, see the [Combining String scores and Phonetic matching](#combining-string-scores-and-phonetic-matching) section of this topic guide."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you would prefer the underlying dataframe instead of the chart, there is the `phonetic_transform_df` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>string1</th>\n",
       "      <th>string2</th>\n",
       "      <th>soundex</th>\n",
       "      <th>metaphone</th>\n",
       "      <th>dmetaphone</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Stephen</td>\n",
       "      <td>Stephen</td>\n",
       "      <td>[S30105, S30105]</td>\n",
       "      <td>[STFN, STFN]</td>\n",
       "      <td>[(STFN, ), (STFN, )]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Stephen</td>\n",
       "      <td>Steven</td>\n",
       "      <td>[S30105, S30105]</td>\n",
       "      <td>[STFN, STFN]</td>\n",
       "      <td>[(STFN, ), (STFN, )]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Stephen</td>\n",
       "      <td>Stephan</td>\n",
       "      <td>[S30105, S30105]</td>\n",
       "      <td>[STFN, STFN]</td>\n",
       "      <td>[(STFN, ), (STFN, )]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Stephen</td>\n",
       "      <td>Steve</td>\n",
       "      <td>[S30105, S3010]</td>\n",
       "      <td>[STFN, STF]</td>\n",
       "      <td>[(STFN, ), (STF, )]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Stephen</td>\n",
       "      <td>Stehpen</td>\n",
       "      <td>[S30105, S30105]</td>\n",
       "      <td>[STFN, STPN]</td>\n",
       "      <td>[(STFN, ), (STPN, )]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Stephen</td>\n",
       "      <td>tSephen</td>\n",
       "      <td>[S30105, t50105]</td>\n",
       "      <td>[STFN, TSFN]</td>\n",
       "      <td>[(STFN, ), (TSFN, )]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Stephen</td>\n",
       "      <td>Stephne</td>\n",
       "      <td>[S30105, S301050]</td>\n",
       "      <td>[STFN, STFN]</td>\n",
       "      <td>[(STFN, ), (STFN, )]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Stephen</td>\n",
       "      <td>Stphen</td>\n",
       "      <td>[S30105, S3105]</td>\n",
       "      <td>[STFN, STFN]</td>\n",
       "      <td>[(STFN, ), (STFN, )]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Stephen</td>\n",
       "      <td>Stepheb</td>\n",
       "      <td>[S30105, S30101]</td>\n",
       "      <td>[STFN, STFP]</td>\n",
       "      <td>[(STFN, ), (STFP, )]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Stephen</td>\n",
       "      <td>Stephanie</td>\n",
       "      <td>[S30105, S301050]</td>\n",
       "      <td>[STFN, STFN]</td>\n",
       "      <td>[(STFN, ), (STFN, )]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Stephen</td>\n",
       "      <td>Richard</td>\n",
       "      <td>[S30105, R02063]</td>\n",
       "      <td>[STFN, RXRT]</td>\n",
       "      <td>[(STFN, ), (RXRT, RKRT)]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    string1    string2            soundex     metaphone  \\\n",
       "0   Stephen    Stephen   [S30105, S30105]  [STFN, STFN]   \n",
       "1   Stephen     Steven   [S30105, S30105]  [STFN, STFN]   \n",
       "2   Stephen    Stephan   [S30105, S30105]  [STFN, STFN]   \n",
       "3   Stephen      Steve    [S30105, S3010]   [STFN, STF]   \n",
       "4   Stephen    Stehpen   [S30105, S30105]  [STFN, STPN]   \n",
       "5   Stephen    tSephen   [S30105, t50105]  [STFN, TSFN]   \n",
       "6   Stephen    Stephne  [S30105, S301050]  [STFN, STFN]   \n",
       "7   Stephen     Stphen    [S30105, S3105]  [STFN, STFN]   \n",
       "8   Stephen    Stepheb   [S30105, S30101]  [STFN, STFP]   \n",
       "9   Stephen  Stephanie  [S30105, S301050]  [STFN, STFN]   \n",
       "10  Stephen    Richard   [S30105, R02063]  [STFN, RXRT]   \n",
       "\n",
       "                  dmetaphone  \n",
       "0       [(STFN, ), (STFN, )]  \n",
       "1       [(STFN, ), (STFN, )]  \n",
       "2       [(STFN, ), (STFN, )]  \n",
       "3        [(STFN, ), (STF, )]  \n",
       "4       [(STFN, ), (STPN, )]  \n",
       "5       [(STFN, ), (TSFN, )]  \n",
       "6       [(STFN, ), (STFN, )]  \n",
       "7       [(STFN, ), (STFN, )]  \n",
       "8       [(STFN, ), (STFP, )]  \n",
       "9       [(STFN, ), (STFN, )]  \n",
       "10  [(STFN, ), (RXRT, RKRT)]  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sa.phonetic_transform_df(data, \"string1\", \"string2\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Combining String scores and Phonetic matching\n",
    "\n",
    "Once you have considered all of the string comparators and phonetic transforms for a given column, you may decide that you would like to have multiple comparison levels including a combination of options.\n",
    "\n",
    "For this you can construct a custom comparison to catch all of the edge cases you want. For example, if you decide that the comparison for `first_name` in the model should consider:\n",
    "\n",
    "1. A `Dmetaphone` level for phonetic similarity\n",
    "2. A `Levenshtein` level with distance of 2 for typos\n",
    "3. A `Jaro-Winkler` level with similarity 0.8 for fuzzy matching\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Comparison 'CustomComparison' of \"first_name\" and \"first_name_dm\".\n",
      "Similarity is assessed using the following ComparisonLevels:\n",
      "    - 'first_name is NULL' with SQL rule: \"first_name_l\" IS NULL OR \"first_name_r\" IS NULL\n",
      "    - 'Exact match on first_name' with SQL rule: \"first_name_l\" = \"first_name_r\"\n",
      "    - 'Jaro-Winkler distance of first_name >= 0.9' with SQL rule: jaro_winkler_similarity(\"first_name_l\", \"first_name_r\") >= 0.9\n",
      "    - 'Levenshtein distance of first_name <= 0.8' with SQL rule: levenshtein(\"first_name_l\", \"first_name_r\") <= 0.8\n",
      "    - 'Array intersection size >= 1' with SQL rule: array_length(list_intersect(\"first_name_dm_l\", \"first_name_dm_r\")) >= 1\n",
      "    - 'All other comparisons' with SQL rule: ELSE\n",
      "\n"
     ]
    }
   ],
   "source": [
    "import splink.comparison_library as cl\n",
    "import splink.comparison_level_library as cll\n",
    "first_name_comparison = cl.CustomComparison(\n",
    "    output_column_name=\"first_name\",\n",
    "    comparison_levels=[\n",
    "        cll.NullLevel(\"first_name\"),\n",
    "        cll.ExactMatchLevel(\"first_name\"),\n",
    "        cll.JaroWinklerLevel(\"first_name\", 0.9),\n",
    "        cll.LevenshteinLevel(\"first_name\", 0.8),\n",
    "        cll.ArrayIntersectLevel(\"first_name_dm\", 1),\n",
    "        cll.ElseLevel()\n",
    "    ]\n",
    ")\n",
    "\n",
    "print(first_name_comparison.get_comparison(\"duckdb\").human_readable_description)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "where `first_name_dm` refers to a column in the dataset which has been created during the [feature engineering](../data_preparation/feature_engineering.md#phonetic-transformations) step to give the `Dmetaphone` transform of `first_name`."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.8"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
